Increased computer peripheral throughput by using data available withholding

ABSTRACT

A method and apparatus for a mutiprocessor system to simultaneously process multiple data write command issued from one or more peripheral component interface (PCD devices by controlling and limiting notification of invalidated address information issued by one memory controller managing one group of multiprocessors in a plurality of mutiprocessor groups. The method and apparatus permits a multiprocessor system to almost completely process a subsequently issued write command from a PCI device or other type of computer peripheral device before a previous write command has been completely processed by the system. The disclosure is particularly applicable to multiprocessor computer systems which utilize non-uniform memory access (NUMA).

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The following patent applications, all assigned to the assigneeof this application, describe related aspects of the arrangement andoperation of multiprocessor computer systems according to this inventionor its preferred embodiment.

[0002] U.S. patent application Ser. No. ______ by T. B. Berg et al.(BEA920000017US1) entitled “Method And Apparatus For Using GlobalSnooping To Provide Cache Coherence To Distributed Computer Nodes In ASingle Coherent System” was filed on Jan. ______, 2002.

[0003] U.S. patent application Ser. No. ______ by S. G. Lloyd et al.(BEA920000019US1) entitled “Transaction Redirection Mechanism ForHandling Late Specification Changes And Design Errors” was filed on Jan.______, 2002.

[0004] U.S. patent application Ser. No. ______ by T. B. Berg et al.(BEA920000020US1) entitled “Method And Apparatus For Multi-path DataStorage And Retrieval” was filed on Jan. ______, 2002.

[0005] U.S. patent application Ser. No. ______ by W. A. Downer et al.(BEA920000021US1) entitled “Hardware Support For Partitioning AMultiprocessor System To Allow Distinct Operating Systems” was filed onJan. ______, 2002.

[0006] U.S. patent application Ser. No. ______ by T. B. Berg et al.(BEA920000022US1) entitled “Distributed Allocation Of System HardwareResources For Multiprocessor Systems” was filed on Jan. ______, 2002.

[0007] U.S. patent application Ser. No. ______ by W. A. Downer et al.(BEA920010030US1) entitled “Masterless Building Block Binding ToPartitions” was filed on Jan. ______, 2002.

[0008] U.S. patent application Ser. No. ______ by W. A. Downer et al.(BEA920010031US1) entitled “Building Block Removal From Partitions” wasfiled on Jan. ______, 2002.

[0009] U.S. patent application Ser. No. ______ by W. A. Downer et al.(BEA920010041US1) entitled “Masterless Building Block Binding ToPartitions Using Identifiers And Indicators” was filed on Jan. ______,2002.

BACKGROUND OF THE INVENTION

[0010] 1. Technical Field

[0011] The present invention relates generally to computer data cacheschemes, and more particularly to a method and apparatus forsimultaneously processing a series of data writes from a standardperipheral computer interface device when having multiple dataprocessors in a system utilizing non-uniform memory access.

[0012] 2. Description of the Related Art

[0013] In computer system designs utilizing more than one processoroperating simultaneously in a coordinated manner, data handling fromperipheral component interface (PCI) devices is controlled in a fashionthat provides only for single transactions to be processed at one timeor in strict order, if multiple data output commands are received fromone of the PCI devices in a system utilizing any number of such devices.In a multiprocessor system which uses non-uniform memory access wheresystem memory may be distributed across multiple memory controllers in asingle system this may limit performance.

[0014] A PCI device, such as a hard disk controller, may issue a writecommand. Any multiple processor address control system will send a“invalidate” indication of the data line to be written to all cachingagents or processors. One method of handling such invalidate's in thepast is that a controller waits to receive acknowledgments that the datainvalidate has been received and then makes that data line available forwriting. The controller then sends an invalidate of a flag line for thatdata line, which was just made available for write. In the prior art,many such controllers will wait to receive acknowledgments from allmemory sources prior to proceeding and then will accept the data fromthe PCI device attempting to write to memory. After such a device writesto the memory management device, that device makes the flag lineavailable. Usually, controllers found in the prior art post writecommands only in the same order as the invalidate commands are issued ona particular PCI bus.

[0015] All of this has the effect of slowing down system speed andtherefore performance, because of component latency and because theability of the system to process multiple data lines while waiting forinvalidate indicators from other system processors is not fullyutilized.

SUMMARY OF THE INVENTION

[0016] A first aspect of this invention is a method for controlling thesequencing of data writes from peripheral devices in a multiprocessorcomputer system. The computer sytem includes groups of processors, witheach processor group interconnected to the other processor groups. Inthe method, a first data write is issued by a peripheral device in thesystem, queued, and checked for completion. The sequence order ofoverlapping write data is tracked. Both the first and the second writedata are processed substantially simultaneously using one or more of thememory systems, but the processed second write data is output only aftercompletion of the first data write. By starting the processing ofsubsequent data writes before completing previous data writes, themethod of the invention increases overall performance of the system.

[0017] Another aspect of this invention is found in a multiprocessorcomputer system, itself The system has two or more groups of one or moreprocessors each. The system also has a peripheral device capable ofinitiating first and second data writes producing first and second writedata, respectively, and a queue capable of sequentially ordering thedata writes. A completion indicator determines completion of the firstdata write, and a sequencer tracks overlapping of the write data, bothin response at least in part to the write data. The system includesstorage for the first and second write data, and output for the firstand second write data which responds at least in part to the sequencerand the completion indicator. The storage for the second write data iscapable of accepting the second write data before completion of thefirst data write, but the output for the second write data is capable ofoutputting the second write data only after completion of the first datawrite.

[0018] Other features and advantages of this invention will becomeapparent from the following detailed description of the presentlypreferred embodiment of the invention, taken in conjunction with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0019]FIG. 1 is a functional block diagram of the inbound data blockwhich buffers data from the PCI bus according to the preferredembodiment of the invention.

[0020]FIG. 2 is a functional block diagram of the inbound data orderqueue of FIG. 1, and is suggested for printing on the first page of theissued patent.

[0021]FIG. 3 is a functional block diagram of the inbound datacompletion arbiter of FIG. 1.

[0022] FIGS. 4A-4C is a logic timing diagram illustrating the inbounddata timing sequence during operation of the preferred embidoment.

[0023]FIG. 5 is a block diagram of a multiprocessor system having a tagand address crossbar and a data crossbar, and incorporating the inbounddata block of FIG. 1.

[0024]FIG. 6 is a block diagram of one quad processor group of thesystem of FIG. 5.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Overview

[0025] The preferred embodiment of this invention allows data beingissued from peripheral component interface (PCI) devices or othercomputer peripheral components to be almost simultaneously processed ina parallel fashion without distortion of the data transaction timingsequence in multiple microprocessor computer. A PCI device can write twocache data lines, the first cache line being called “data” and thesecond line being called a “flag”. A memory control device for a groupof processors receives these two cache line write transactions from aPCI bridge device interfacing between the PCI peripheral and the controlsystem. The control system issues a write command, but the controlsystem does not process subsequent transactions at that time.

[0026] The control system looks up the state of the data cache line andissues invalidates to any processors in the system that hold a copy ofthat data line. At the same time the control system signals othersimilar control systems associated with all other groups of processorsthat the invalidates have been issued. The various control systems eachassociated with a group of processors invalidates the data line in theprocessor's cache on that particular group of processors. Suchcontrollers send an “acknowledge” command back to the originalcontroller indicating that they have invalidated the proper line inresponse to the original controller's request.

[0027] Overall performance of the computer system is improved becausethe system handles the data for the “data” cache line but does not makethat data line visible to the rest of the system controllers. As soon asthe “data” has been moved into the central memory cache system, thecontroller signals a particular PCI bridge chip that it has completedthe write and has deallocated the buffer space so that it can receivemore transactions. Once the originating controller has received theindication from the central controller (comprised of a tag and addresscrossbar), that the invalidates will eventually be issued, the “writeflag” transaction can now proceed with the above steps of issuing the“write flag” command to the central controller, having the centralcontroller look up the states and issue invalidates, and having theother controllers invalidate their processors and sending “acknowledge”commands to the original controller. The originating controller willreceive acknowledges from both the “data” and “flag” lines.

[0028] Acknowledges of invalidates can thus be received in any order.For example, the acknowledges for the flag may be received before orafter the acknowledges for the data without corruption of the orderingsequence. The invention ensures that the “data” line is made visible tothe rest of the system only after all acknowledges for “data” line havebeen received. The invention also ensures that the “flag” line is madevisible after both the “data” has been made visible and all acknowledgesfor the “flag” have been received.

Technical Details

[0029] The present invention relates specifically to an improved datahandling method for use in a multiple processor system which utilizes atagging and address crossbar system for use in combination with a datacrossbar system, together comprising a data processing system to processmultiple data write requests concurrently while maintaining the correctorder of the write requests. The system maintains transaction orderingfrom a given PCI Bus throughout an entire system which employs multipleprocessors all of which may have access to a particular PCI Bus. Thedescribed invention is particularly useful in non-uniform memory access(NUMA) multi-processor systems. NUMA systems partition physical memoryassociated with one local group of microprocessors into locallyavailable memory (home memory) and remote memory or cache for use byprocessors in other processor groups within a system. In such systems,apparatus which coordinates peripheral component interface (PCI) devicesto control ordering rules for processing write commands must coordinatedata transfer to prevent overwriting or out of sequence data forwarding.Namely, a series of write commands from a PCI device must be madevisible to the system in the precise order that they were issued by thePCI device. Some tag and address crossbar systems used in multipleprocessor systems cannot allow a line of data to be made visible to thesystem until that line has been invalidated (i.e., an “invalidate” hasbeen issued) thus insuring that no processor in the system has access tothat line. This has the effect of limiting the processing speed of thesystem.

[0030]FIG. 5 presents an example of a typical multiprocessor systems inwhich the present invention may be used. FIG. 5 illustrates amulti-processor system which utilizes four separate central controlsystems (control agents) 66, each of which provides input/outputinterfacing and memory control for an array 64 of four Intel brandItanium class microprocessors 62 per control agent 66. In manyapplications, control agent 66 is an application specific integratedcircuit (ASIC) which is developed for a particular system application toprovide the interfacing for each microprocessors bus 76, each memory 68associated with a given control agent 66, F16 bus 21, and PCIinput/output interface 80, along with the associated PCI bus 74 whichconnects to various PCI devices.

[0031]FIG. 5 also illustrates the port connection between each tag andaddress crossbar 70 as well as data crossbar 72. As can be appreciatedfrom the block diagram shown in FIG. 5, crossbar 70 and crossbar 72allow communications between each control agent 66, such that addressinginformation and data information can be communicated across the entiremultiprocessor system 60. Such memory addressing system is necessary tocommunicate data locations across the system and facilitate update ofcontrol agent 66 cache information regarding data validity and requireddata location.

[0032] A single quad processor group 58 is comprised of microprocessors62, memory 68, and control agent 66. In multiprocessor systems to whichthe present invention relates, quad memory 68 is usually random accessmemory (RAM) available to the local control agent 66 as local or homememory. A particular memory 68 is attached to a particular controlleragent 66 in the entire system 60, but is considered remote memory whenaccessed by another quad or control agent 66 not directly connected to aparticular memory 68 associated with a particular control agent 66. Amicroprocessor 62 existing in any one quad processor group 58 may accessmemory 68 on any other quad processor group 58. NUMA systems typicallypartition memory 68 into local memory and remote memory for access byother quads processor groups 58. The present invention enhances theentire system's ability to keep track of data writes issued by PCIdevices that access memory 68 which is located in a processor group 58different from and therefore remote from, a processor group 58 which hasa PCI device which issued the write.

[0033] The present invention permits a system using multiple processorswith a processor group interface control system and an address tag andcrossbar system to almost completely process subsequent PCI deviceissued writes before previous writes from such PCI devices have beencompletely processed by the system. The present invention provides for amethod in which invalidates for a second write can be issued in parallelwith the invalidates for a first write being issued by a given PCIdevice. The invention ensures that the second write is not visible tothe system, that is that no processor on a multiprocessor board can readthe data from the second write, until the invalidates from the firstwrite has been received. In the preferred embodiment, multiple writesissued sequentially by a PCI device may be processed in parallel and outof sequence with the order in which the writes were issued whileinsuring that the output sequence of the writes remain identical to theorder of the input writes. FIG. 6 is a different view of the samemultiprocessor system shown in FIG. 5 in a simpler view, illustratingone processor group 58, sometimes referred to as a Quad, in relation tothe other Quad 58 components as well as crossbar system 70 and 72illustrated for simplicity as one unit in FIG. 6.

[0034] Turning now to FIG. 1, the invention will be described withreference to the functional block diagram for the inbound data block.Inbound data block (IDB) 10 interfaces with data organizer (DOG) 12,request completion manager (RCM) 14, transaction manager (X-MAN) 16,inbound command block (ICB) 18, and the F16 interface block 20. IDB 10buffers the data from F16 bus 21 to the DOG 12. IDB 10 is alsoresponsible for maintaining the data ordering to allow the posting ofinbound transactions from the PCI bus 74 being generated by a PCI deviceand delivered from each PCI input-output interface 80. F16 interfaceblock 20 represents an initial interface inside control agent 66 to thegroup of four F16 bus 21. Block 20, which is associated with each F16bus 21, pushes data into the inbound data queue (IDQ) 22 as it arrivesfrom an individual F16 bus 21. In the preferred embodiment of thepresent invention, F16 bus 21 is comprised of a proprietary designutilized by the Intel brand of microprocessors and associatedmicroprocessor chip sets which is known commonly as the Intel F16 bus.As can be seen in FIG. 5, F16 bus 21 acts as a bridging bus betweencontrol agent 66 and PCI input/output (IO) interface 80 which canconnect a particular PCI device to the system. The invention presentlydescribed may be applied to other types of data interfaces which issuedata sequentially for use by a processor system. Though the preferredembodiment is described utilizing an Intel brand PCI bridge chip set, itshould be appreciated that other device interfaces utilizing othercomponent bus interface systems which interconnect system devices, suchas disk drives, video cards or other peripheral components may beutilized in carrying out the system described herein. Each individualquad control agent 66 utilizes four of such PCI bus 74 connected throughPCI bridge chips 80 which are in turn connected to the F16 interfaceblock 20 contained within each individual control agent 66 via the F16bus 21. The four F16 buses 21 operate in a parallel fashion andsimultaneously as illustrated in FIG. 5.

[0035] Each microprocessor 62 communicates to the rest of the systemthrough its individual processor bus 76, which communicates with itsrespective control agent 66, especially in the preferred embodimentwherein each group 64 of four Itanium microprocessors 62 are alsoconnected in a larger array comprised of four quads 58 of like processorarrays as depicted in FIG. 5. The Itanium microprocessor used in thepreferred embodiment is a device manufactured by Intel.

[0036] RCM 14 acts as the control responsible for tracking the progressof all incoming transactions initiated by the PCI bus 74 and schedulingand dispatching the appropriate data responses. RCM 14 controls the datasequencing and steering through DOG 12 and streamlines the data flowthrough the data crossbar system used to connect multiple processorgroups in either a single quad or multiple quad processor systemconfigurations. DOG 12 provides data storage and routing between each ofthe major interface modules shown in FIG. 1. The heart of DOG 12 is adata buffer in which up to 64 cache lines of data can be stored.Surrounding the storage area is logic hardware that provides for no-waitwrites into and low latency reads out of the data buffer within DOG 12.

[0037] Continuing with FIG. 1, when all inbound data from the F16interface block 20 has been moved, the inbound command block 18 alsoschedules the transaction in the inbound data order queue (IDOQ) 24.IDOQ 24 makes the transaction available to the inbound data handler(IDH) 26 for data movement. IDOQ 24 is responsible for keeping theproper order of all data flowing from given F16 bus 21 to a givencontrol agent 66. Specifically, IDOQ 24 tracks and maintains the orderof all inbound writes to memory 68, inbound responses to outbound readswhen processor 62 reads data on a PCI device, and inbound interrupts inthe system. In system 60, the control agents 66 must track the order ofall inbound data, whether a PCI device write or other data. IDOQ 24keeps track of such data to maintain information regarding the order ofany data. IDH 26 schedules the transaction and moves the data from IDQ22 to DOG 12. After the data has moved, IDOQ 24 is notified and theresources used by the transaction are freed to either the transactionmanager 16 or the outbound command block (not shown).

[0038] If the transaction being handled was a cacheable write, then RCM14 will signal that all of the invalidations have been collected Onceall of the ordering requirements have been met, the inbound datacomplete arbiter 30 will signal RCM 14 that the transaction is completeand the data is available to DOG 12. If the transaction being processedis a cacheable partial write, then IDH 26 will signal DOG 12 to placethe transaction in a cacheable partial write queue where it awaits thebackground data.

[0039] Turning now to FIG. 2, Inbound Data Order Queue 24 is presentedin operational terms. IDOQ 24 has four main interfaces as shown in FIG.2. Those interfaces are the Inbound Command Block (ICB) 18, the ResponseCompletion Manager (RCM) 14, the Inbound Data Completion Arbiter (IDCA)30 and the Inbound Data Handler (IDH) 26.

[0040] When a write operation is issued from F16 bus 21, it enters ICB18 through F16 interface block 20. ICB 18 presents the write request totransaction manager 16, which is located within control agent 66, routesthe write request to a bus within control agent 66 connected to tag andaddress crossbar 70 as more fully illustrated in FIG. 5. Tag and addresscrossbar 70 receives the write operations, looks up tag information forthe corresponding address of the data and determines which agent 66 inquad processor system 60 must receive an invalidate operation. As tagand address crossbar 70 issues an invalidate to other control agents 66,tag and address crossbar 70 also issues a reply to the requesting agent66 which indicates how many acknowledgments it should receive. Further,tag and address crossbar 70 also signals the requesting control agent 66if it must also issue an invalidate operation on a processor bus 76connected to that control agent 66. In the event that the effectedcontrol agent 66 must issue an invalidate operation, such operation istreated like any other invalidate or acknowledge pair command. Tag andaddress crossbar 70 provides a reply as indicated above, the transactioninvolved is moved from ICB 18 to IDOQ 24. At this point, IDOQ 24 beginsto track the order and location of the particular data being read fromthe PCI bus 21.

[0041] In the event of an outbound read event ICB 18 receives a dataresponse from F16 bus 21 through F16 interface block 20 for a previouslyissued outbound read, for example, if a processor is reading from a PCIcard, and ICB 18 does not give the command to transaction manager 16.Instead, ICB 18 forwards the command directly into IDOQ 24 as more fullydescribed below.

[0042] When tag and address crossbar 70 has given a reply to a newinbound data operation, ICB 18 asserts a valid signal along with otherinformation regarding that particular new inbound data. Such additionalinformation which is tracked includes the transaction identification(TrID) for the transaction; whether the transaction is an inbound writerequest or a response to an outbound read or an interrupt; whether thedata is a full cache line or just a partial line of data; and if theoperation involves a partial line of data, which part of the cache lineto which the data is associated.

[0043] As ICB 18 forwards the operation to IDOQ 24, as shown in FIG. 1,other logic is moving corresponding data into IDQ 22. IDQ 22 functionsas a true first-in/first-out (FIFO), so that the order of TrID's givenfrom ICB 18 to IDOQ 24 must match the order of data loaded into IDQ 22.

[0044] When IDOQ 24 receives a new operation from ICB 18, suchinformation as described above is loaded into a register within IDOQ 24,shown more particularly in FIG. 2 as CAM TrID register 84. In thepreferred embodiment, there are eight queue locations in the register.The register location which is loaded is the location pointed to writepointer 86 in FIG. 2. For example, if pointer 86 has a value of 3, thenIDOQ 24's register 3 is loaded. Once such an operation is written to theregister, write pointer 86 is incremented so that the next operationwould go to the next queue location in turn. If write pointer 86 is avalue of 7 and is then incremented, it rolls over and begins at zeroagain. It will be appreciated by those skilled in the art that usingpointers in this manner is a known method used for implemented queues ina variety of different queuing operations or procedures.

[0045] When all of the acknowledgments have been received for aparticular write operation, RCM 14 asserts an (ACK) signal and providesa corresponding transaction identification (TrID), shown at 88 inFIG. 1. IDOQ 24 recognizes the ACK signal provided by RCM 14 and thensimultaneously compares the ACK and TrID 88 to each TrID in theregisters within the IDOQ 24. The logic in IDOQ 24 then sets thecorresponding ACK DONE data bit with the register within IDOQ 24 thatcontains the TrID that matches the information in the ACK TrID 88.

[0046] Continuing with FIG. 2, the functional interface between IDOQ 24and Inbound Data Handler (IDH) 26 will be described. IDOQ 24 suppliesinformation to IDH 26 corresponding to the next data line to be movedfrom the IDQ 22 to the DOG 12. IDOQ 24 passes on such information itreceives from ICB 18 for a given transaction, specifically those itemsdescribed above regarding pertinent information about a validtransaction asserted by ICB 18. IDH 26 does not differentiate whethersuch an operation is either a read or a write. Using the informationdescribed above regarding the data, IDH 26 controls a transfer of datafrom IDQ 22 to DOG 12 appropriately.

[0047] In accordance with the description above, IDOQ 24 must providethe TrIDs of IDH 26 in the same order as such transactionidentifications were received from ICB 18, as data was loaded into IDH26 in the same order and must be called up from IDH 26 in the correctorder. To implement the ordering, IDOQ 24 utilizes a move pointer 90,shown in FIG. 2. As the present system is initialized or reset, movepointer 90 as well as write pointer 86 are set to an initial value ofzero. When IDOQ 24 is loaded by ICB 18, write pointer 86 is incrementedas earlier described. When the value of write pointer 86 is not equal tothe value of move pointer 90, IDOQ 24 is signaled that there is data tobe moved and thereby asserts a valid signal to IDH 26. In the event thatthe value of write pointer 86 and the value of move pointer 90 are notequal, as can be seen in FIG. 2, the compare block 93 asserts a validsignal when such values are not equal. IDOQ 24 also supplies informationfrom the IDOQ 24 registers which are being identified by write pointer86.

[0048] Once data has been moved from IDQ 22 to DOG 12, IDH 26 signalsIDOQ 24 by asserting a data moved signal shown at 94. When this occurs,move pointer 90 is incremented to the next value. If there are no morevalid entries in IDOQ 24, move pointer 90 will be equal to write pointer86 and the valid signal will be un-asserted. In the event that there isanother entry in IDOQ 24, write pointer 86 and move pointer 90 will notbe equal and thus IDOQ 24 will maintain a valid condition and supply theTrID corresponding to the next IDOQ 84 register. Continuing to considerFIG. 2, IDOQ 24 is also operatively connected to inbound data completionarbiter (IDCA) 30. The operational connection described in FIG. 2 allowsIDCA 30 to provide an indication that data for a particular TrID isavailable to be read by another transaction. This available transactionmust only be read, for a given TrID, once its data has been moved intoDOG 12 and, if the transaction being considered was a write, IDOQ 24must have also received an acknowledged done (ACK-DONE) indication forthat TrID. The order of the TrID's given to IDCA 30 must be the sameorder as the TrID supplied originally by ICB 18. There is a separateIDOQ24 for each F16 bus 21, each IDOQ 24 handling data from separate PCIbuses. Inbound Data Complete Arbiter 30 is responsible for looking atthe data item at the head of each IDOQ 24, determining which if any ofeach data item can be sent to the RCM 14. IDCA 30 selects from theoutput of various IDOQ 24's which may be producing data simultaneously.IDCA 30 determines when each IDOQ 24 can send data to RCM.

[0049] Continuing to consider FIG. 2, available pointer 96 always pointsto the top available position of the queue. Once pointer 96 isincremented past a particular data entry in the queue, that entry is notconsidered valid. IDOQ 24 supplies IDCA 30 with information regardingthe value at the top of the queue. Such value is the value entered inthe IDOQ 24 register which is currently selected by available pointer96. IDOQ 24 provides the TrID, such TrID's ACK DONE bit as describedabove, and determines whether the data has been moved by considering themove signal output 91 from compare block 92 shown in FIG. 2. The IDCA 30will then assert a signal to RCM 14 (also shown as connection betweenIDCA 30 and RCM 14 in FIG. 1) for a given TrID. Once that TrID issupplied by IDOQ 24 and when both the ACK and moved signals areasserted, if the operation is a read, or an interrupt as opposed to awrite operation, the ACK DONE bit will automatically be set so that IDCA30 will only wait for the data to be moved for that operation.

[0050] Once the ACK and moved signal 91 are both asserted for a givenTrID, IDCA 30 signals RCM 14 that another transaction can read thecorresponding data. Also at this time, IDCA 30 signals that availablepointer 96 should be incremented by transmission of the increment signal98 shown in FIG. 2. Signal 98 increments the available pointer 96 to thenext entry available in IDOQ 24. Move signal 91 is asserted if availablepointer 96 is not equal to move pointer 90. It can appreciated thatcompare block 92 produces move signal 91 if available pointer 96 is notequal to pointer 90.

[0051] When move signal 91 is issued, data for the TrID at the top ofthe IDOQ 24 to which available pointer 96 has incremented has been movedinto DOG 12. However, the affected TrID has not been given to RCM 14through IDCA 30 at that time. Once IDCA 30 has given the TrID to RCM 14,available pointer 96 is incremented and that operation is no longerconsidered in IDOQ 24.

[0052] It should be noted in the event that all of the registers in IDOQ24 are fill, (the IDOQ 24 in the preferred embodiment having 8registers), and if all such registers have valid TrIDs, IDOQ 24 assertsa PCI full signal 97 shown in FIG. 2. In this condition, signal 97indicates to ICB 18 that the IDOQ 24 cannot handle any more requests andtherefore must not issue any more operations to IDOQ 24 until a registeris available. For inbound writes, once the TrID is into IDOQ 24, and thedata is moved into IDQ 22, control agent 66 issues a response back toPCI input/output interface 80 that the PCI device can send moreoperations even though the previous writes are not complete.

[0053] Turning now to FIG. 3, the inbound data completion arbiter 30will be described. When the transaction is both Valid and Acknowledged34, then it enters arbitration for signaling to the RCM 14. The winnerof this arbitration process is signaled to the RCM 14, and thecorresponding IDOQ 24 is notified that the transaction has completed.The last PCI register indicates which of the IODQ 24 has most recentlywon arbitration in the above process, and is used by Round Robin Arbiter38.

[0054] Inbound Data Queue 22 is a memory that stores the inbound dataand byte enables from either a write request or a read completion fromF16 interface block 20. IDQ 22 is physically comprised of memory storingthe byte enables, and two memories storing the data associated with a128 bit line of data with a 16 bit Error Correction Code (ECC) elementattached. It should be understood that when reference is made to a 128bit line of data, a 128 bit data word with 16 bit ECC is included. Datais written to the IDQ 22, one data word of 64 bits plus 8 bits of ECC ata time and is properly aligned with the 128 bit line of data by F16block 20. IDQ 22 is protected by ECC Codes for the data and by parityfor the byte enables. The ECC is generated by F16 interface block 20 andchecked by the DOG 12 while the parity of the data is checked locally.

[0055] Turning to FIG. 4, (presented in three parts as FIGS. 4A, 4B and4C for clarity but representing one diagram), an Inbound Data TimingDiagram for the IDB 10 illustrating both a partial and full cache linewrite request is shown FIG. 4 illustrates the timing sequence of both apartial and a full cache line write request for the entire Inbound DataBlock 10. IDOQ 24 is maintaining the order of the data transfers. Thetransactions have the invalidates collected in the order two, one, andfour, shown at ACK TrID 54, but the original data order is maintain asshown in the RCM moved TrID timing line 55 in FIG. 4. RCM 14 is notnotified of the transaction two's data movement until transaction onehas had the invalidates collected. Transaction three in FIG. 4 was aread transaction and therefore does not require an acknowledgment forsignaling the RCM 14. Finally, transaction four in the FIG. 4. TimingDiagram is not signaled to the RCM 14 until the data has been moved,since the acknowledgment was signaled a few clock cycles before itstarted to transfer.

Advantages

[0056] The preferred embodiment improves the logical sequencing of datawrites of PCI devices in a multiprocessor system having a plurality ofmemory systems, each memory system associated with at least oneprocessor but using a common data cache system and control system forall of the processors. The method described provides for overlappingdata write processing so that processing of subsequent write commandsissued by a PCI device can begin prior to the completion of previouswrite commands issued by a PCI device without corrupting the originaltransaction order required to maintain fidelity of the transactiontiming. This overlapping results in increased system performance byincreasing the number of transactions that can be processed in a givenperiod.

Alternatives

[0057] The invention can be employed in any multiprocessor system thatutilizes a central control agent for a group of microprocessors,although it is most beneficial when used in conjunction with a taggingand address crossbar system along with a data crossbar system whichattaches multiple groups of processors employing non-uniform memoryaccess and divided or distributed memory across the system.

[0058] The particular systems and method which allows parallelprocessing of sequentially issued PCI device write commands through thedevice bus in a multiprocessor system as shown and described in detailis fully capable of obtaining the objectives of the invention. However,it should be understood that the described embodiment is merely anexample of the present invention, and as such, is representative ofsubject matter which is broadly contemplated by the present invention.

[0059] For example, the present invention is disclosed in the context ofa particular system which utilizes 16 processors, comprised of fourseparate groups of four with each group of four assigned to a memorycontrol agent which interfaces the PCI devices, memory boards allocatedto the group of four processors, and for which the present inventionfunctions to communicate through other subsystems to like controllers inthe other three groups of four disclosed. Nevertheless, the presentinvention may be used with any system having multiple processors, withseparate memory control agents assigned to control each separate groupof multiprocessors when each group of processors requires coherence orcoordination in handling data read or write commands for multipleperipheral devices utilizing various interface protocols forsequentially issued data writes from other device standards, such asISA, EISA or AGP peripherals.

[0060] The system is not necessarily limited to the specific numbers ofprocessors or the array of processors disclosed, but may be used insimilar system design using interconnected memory control systems withtagging, address crossbar and data crossbar systems to communicatebetween the controllers to implement the present invention. Accordingly,the scope of the present invention fully encompasses other embodimentswhich may become apparent to those skilled in the art, and is to belimited only by the claims which follow.

We claim:
 1. A method for controlling sequencing of data writes fromperipheral devices in a multiprocessor computer system having aplurality of memory systems, each processor group being operativelyinterconnected to each other processor group, the method including thesteps of: queuing a first data write issued by a peripheral device ofthe system; determining whether said first data write is complete;tracking the sequence order of the first and second data writes;processing said second write data substantially simultaneously withprocessing of said first write data, said processing of the write datausing one or more of the memory systems; and outputting the processedfirst write data, and then outputting the processed second write dataonly upon completion of said first data write.
 2. The method of claim 1,wherein outputting the processed write data comprises outputting thesecond write data only after receiving all acknowledges of invalidatesignals from the first and second data writes.
 3. The method of claim 1,wherein third write data of a third date write is processedsubstantially simultaneously with the first and second write data. 4.The method of claim 3, wherein outputting the processed write datacomprises outputting the second write data only after receiving allacknowledges of invalidate signals from the first and second data writesand outputting the third write data only after receiving all invalidatesignals from the first, second and third data writes.
 5. The method ofclaim 1, wherein the memory system provides non-uniform memory accessbetween the groups.
 6. The method of claim 1, wherein saidmultiprocessor system includes a common data cache system utilized forall the processors.
 7. The method of claim 1, wherein said groups areinterconnected by at least one crossbar.
 8. The method of claim 6,wherein said groups are interconnected by a tag and address crossbar andby a data crossbar.
 9. The method of claim 1, wherein said groups areinterconnected through a central hardware device.
 10. A computer systemcomprising: first and second interconnected groups of one or moreprocessors each; a peripheral device associated with one of the groupsand capable of initiating first and second data writes producing firstand second write data, respectively; a queue capable of sequentiallyordering the data writes; a completion indicator of the first datawrite, said indicator being responsive to the write data; a sequencerresponsive to the write data and capable of tracking overlapping of thewrite data; storage for the first and second write data; and output forthe first and second write data responsive to the sequencer and thecompletion indicator, wherein the storage for the second write data iscapable of accepting the second write data before completion of thefirst data write; and wherein the output for the second write data iscapable of outputting the second write data only after completion of thefirst data write.
 11. The system of claim 10, further comprising commondata cache system utilized for all the processors.
 12. The system ofclaim 10, further comprising one or more crossbars interconnecting thegroups.
 13. The system of claim 12, wherein the one or more crossbarscomprise a tag and address crossbar and a data crossbar.
 14. The systemof claim 10, further comprising a central hardware deviceinterconnecting the groups.
 15. Apparatus for maintaining data orderingwhile substantially simultaneously processing data issued from at leastone peripheral device which issues data transactions associated withmultiple processor systems utilizing at least two processors associatedwith a memory system comprising: memory control operatively connected tosaid processors, said memory system and said peripheral device; tagtracking control for data read and write information issued by saidperipheral device; address crossbar connected between at least two ofsaid memory controls; data crossbar connected between at least two ofsaid memory controls; transaction sequencer responsive to transactionsissued by said peripheral device; transaction completion store for saidtransactions; memory allocator responsive to said transaction completionstore; data write preventer controlling a second memory system remotefrom the memory control associated with the peripheral device to preventissuance of a subsequent data write before a previous data write iscompleted; and transaction output responsive to the transactionsequencer and the transaction completion store.
 16. Apparatus formaintaining inbound data ordering while substantially simultaneouslyprocessing data issued from at least one peripheral computer devicewhich issue data transactions associated with utilizing non-uniformmemory access with a plurality of memory systems interconnected withmultiple processor systems utilizing at least two processors associatedwith said computer memory system comprising: peripheral computer deviceinterface means for receiving data writes in the form of transactionsfrom a peripheral computer device, comprising: transaction completionmeans for determining whether a given data write transaction from saidperipheral computer device is complete; transaction management means fortracking the completion state and the memory location of a given datatransaction; data organizer means for storage and queuing of each datatransaction order including inbound data queuing means and inbound datahandler means wherein said inbound data order queuing means tracks theorder of each transaction to maintain the sequence of each transactionwith the order said transaction was issued from the said computerperipheral device and said inbound data handler means stores each saidtransaction being tracked by said inbound data queuing means; wherein atleast two write data transactions from each peripheral computer deviceissued sequentially may be processed substantially simultaneously by thesystem, and each said data transaction is outputted in the same sequenceas issued by said peripheral computer device.
 17. A system forcontrolling sequencing of data writes from peripheral computer devicesin a multiprocessor system utilizing non-uniform memory access with aplurality of memory systems, each memory system associated with at leastone processor group and a common data cache system utilized for all themicroprocessors, and each processor group is operatively interconnectedto each other processor group by a tag and address crossbar, and a datacrossbar, said system providing for overlapping data write processing tobegin processing of subsequent data writes prior to completion ofprevious data writes, comprising: a memory control system for each ofsaid processor groups, each said memory control system beinginterconnected through said tag and address crossbar, and a datacrossbar; memory coupled to each memory control system; and a pluralityof the sub-systems comprising data processors, the plurality of saiddata processors storing and forwarding in said memory multiplesequentially issued write data transaction, from a peripheral computerdevice, and a respective set of transaction identification tags,including one tag for each data transaction stored by said memory; eachof the plurality of data processors being coupled to the memory controlsystem, for sending memory transaction requests to the memory controlsystem; the interface for each of the data processors that has a memoryfor receiving transaction requests from the memory control systemcorresponding to memory transaction requests by other ones of the dataprocessors; each memory transaction request having an associated addressvalue and a order of issue value; the memory control system including:transaction handling means for activating each memory transactionrequest when it meets predefined activation criteria, and for holdingeach memory transaction request until the predefined activation criteriaare met; wherein the predefined activation criteria include antransaction ordering conflict criterion that is a function of theaddress value of each transaction and the order which each transactionis issued by said peripheral computer device associated with the memorytransaction request and the address value of activated memorytransaction requests; a transaction management means that stores activetransaction status data representing memory transaction requests whichhave been activated by inbound data handling means, the activetransaction status data including data for each activated transactionrepresenting an address value associated with the transaction; theactive transaction status data including data representing memorytransaction requests received from the plurality of data processors; andmemory transaction request means for processing the memory transactionrequest after it has been activated by the transaction activation means;the transaction management means including request completion means forcomparing one not-yet-complete memory transaction request with thestored active transaction status data for all activated memorytransaction requests so as to determine whether activation of the eachmemory transaction request would violate the predefined activationcriteria with respect to any of the system memory transaction requests;wherein the transaction management means holds all transaction requestsby any of the data processors that violate the predefined activationcriteria with respect to any memory transaction that has already beenactivated.
 18. The system of claim 17, wherein said predefinedactivation criteria is determining whether all previously issued writetransactions have been completed.
 19. Apparatus for maintaining dataordering while substantially simultaneously processing data issued fromat least one peripheral computer device which issue data transactionsassociated with multiple processor systems utilizing at least twoprocessors associated with a computer memory system comprising: memorycontrol means operatively connected to each said processors, memorysystem and peripheral computer devices; tagging means for tracking dataread and write information issued by said peripheral computer device;address crossbar means for interfacing between at least two of saidmemory control means; data crossbar means for interfacing between atleast two of said memory control means; means to track the timingsequence of a first transaction issued by any of said peripheralcomputer devices; means to receive and track the sequence of asubsequently issued transaction from said peripheral computer device;means to determine and store information from each peripheral computerdevice transaction relative to the state of completion of the datacontained therein; means for comparing the state of completion of thedata contained in said first transaction and said subsequently issuedtransaction and allocating space in said memory system; means forpreventing the memory system of a memory control remote from the memorycontrol means associated with the peripheral computer device issuing thetransactions from issuing a subsequent data write before a previous datawrite is completed; and means for outputting of said first and saidsubsequent transaction, sequenced in the same order as the originalorder of said first and said subsequent transaction.